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ABSTRACT 

SHAPE (Selective 2-hydroxyl acylation analysed 
by primer extension) technology has emerged as 
one of the leading methods of determining RNA 
secondary structure at the nucleotide level. A sig- 
nificant bottleneck in using SHAPE is the complex 
and time-consuming data processing that is 
required. We present here a modified data collection 
method and a series of algorithms, embodied in a 
program entitled Fast Analysis of SHAPE traces 
(FAST), which significantly reduces processing 
time. We have used this method to resolve the sec- 
ondary structure of the first 900 nt of the hepatitis 
C virus (HCV) genome, including the entire core 
gene. We have also demonstrated the ability of 
SHAPE/FAST to detect the binding of a small 
molecule inhibitor to the HCV internal ribosomal 
entry site (IRES). In conclusion, FAST allows for 
high-throughput data processing to match the 
current high-throughput generation of data 
possible with SHAPE, reducing the barrier to 
determining the structure of RNAs of interest. 

INTRODUCTION 

An understanding of the biological role of viral RNA is 
facilitated by knowledge of its higher order structure. 
Selective 2'-hydroxyl acylation analysed by primer exten- 
sion (SHAPE) has emerged as one of the most robust and 
well-characterized methods of mapping RNA secondary 
structure at single-nucleotide resolution. SHAPE has been 
used to map the structure of RNAs as large and diverse as 
the ribosome and the HIV genome (1-4). 

In SHAPE, the RNA of interest is chemically 
interrogated by addition of an electrophile, such as 
N-methylisatoic anhydride (NMIA), which preferentially 



acylates conformationally flexible nucleotides at the ribose 
2'-OH position (5). These 2'-0-adducts result in termin- 
ation events during subsequent reverse transcription (RT). 
The result is a pool of DNA fragments whose lengths 
correspond to the location of flexible nucleotides. The 
use of fluorescent-labelled primers allows these fragments 
to be resolved by capillary electrophoresis (6). In SHAPE, 
for every experimental condition, an experimental dataset, 
a control dataset, and one to two sequencing ladders 
need to be generated. Thus, three to four electrophoretic 
traces are generated for each region of RNA, which must 
be processed in order to calculate the SHAPE reactivity of 
each nucleotide in the RNA of interest. This data process- 
ing is challenging and laborious because neither the j-axis 
scale nor the x-axis scale is the same for each trace (3). 

Each trace has a different x-axis scale because each 
fiuorophore migrates through the capillary matrix at a 
different rate (3). Each fiuorophore also interacts with 
the 5' terminal nucleotides of the DNA fragment to 
which it is attached; consequently, both the identity of 
the fiuorophore and its adjacent sequence affect the rate 
of DNA fragment migration. In DNA sequencing, the fact 
that only one peak should occur at each location allows 
for algorithmic correction. In SHAPE, the data lacks such 
registry information: the control and experimental (and 
ladder) peaks can and often should occur at the same 
location (3). 

Two factors predominantly govern why each trace has a 
different j'-axis scale. First, individual fluorophores have 
different spectral properties (3). Thus, a DNA fragment 
with the same concentration but labelled with two differ- 
ent fluorophores will have two different fluorescent 
intensities. Second, human error inevitably introduces 
variances in RNA (and later DNA) concentrations 
during sample processing. 

SHAPEfmder is a program that was written in order to 
address these scaling issues (3). This program allows the 
user to visually adjust traces, allowing for normalization 
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Table 1. Comparison between FAST and SHAPEFinder 



FAST 



SHAPEFinder 



Processing corrections 

Baseline adjustments 

Integration of peak areas 

Fluorophore spectral overlap 

Fluorophore-specific mobility shifts (x-axis) 

Peak Identification 

Signal Decay 

Signal Scaling (v-axis) 
Notable characteristics 

Electrophoresis technology 

Time Required (expert) 

Specific mobility file for each region of RNA 
Cost 

Read length 
Sequencing ladder 
Peak identification 



Automated" 
Automated" 
Less important 
Not needed 
Partially Automated 
Automated 
Automated 

Fragment Analysis 

5-15min 

NO 

1 -labelled Primer/300 nt 
-250-350 

Once per RNA region 
Once per RNA region 



Automated 
Automated 
Automated 
Partially automated 
Partially automated 
Visual manipulation 
Visual manipulation 

DNA sequencing 

1-2 h 

YES 

4-labelled Primers/600 nt 
-600 

Needed for every run 
Needed for every run 



"Baseline adjustment and Peak Area integration are performed initially using PeakScanner 



of traces to a common x- and j-axis. Visual manipulation, 
however, is a time-consuming and user-specific process. 

Our goal was therefore to generate a SHAPE system 
that would obviate or automate the need for x- and 
j-axis scaling by visual inspection. Here, we describe a 
windows program, entitled FAST, to address this goal. 
To enhance its accessibility, FAST has been encoded 
as a module within the widely used program, 
RNAstructure (7). 

FAST builds upon the idea of using the same 
fluorophore-labelled primer for all arms of a study, as 
first introduced in the program CAFA (8). Using the 
same fluorophore-labelled primer holds the migration 
offset introduced by the fluorophore constant among all 
conditions. Consequently, x-axis scaling is not needed. 
With respect to j-axis scaling, the need to correct for 
spectral differences among fluorophores is also eliminated 
by the use of the same fluorophore-labelled primer. In this 
manner, we have reduced the scaling problem down to 
correcting for j-axis differences that result from 
handling error. 

In order to use one fluorophore for all arms of a study, 
capillary to capillary comparisons must be possible. 
Capillary to capillary comparisons are made possible by 
using an internal standard — a technique common to 
fragment analysis (9). The use of an internal standard to 
normalize data among capillaries has additional advan- 
tages: the control arm of the study, and any associated 
sequencing ladders, need to be generated only once, 
rather than needing to be generated repeatedly along 
with each and every experimental arm (4). 

The most crucial step in the SHAPE process is proper 
alignment between the peaks found in a trace and the 
RNA sequence — i.e. assigning a nucleotide position to 
each peak. We refer to this process as peak identification. 
In SHAPEfinder, this process is highly dependent on 
the user correctly performing x-axis calibrations. 
Furthermore, because the original SHAPE method does 



not allow for traces from different capillaries to be 
compared, calibration must be performed for every con- 
dition in an experiment. In contrast, FAST allows the 
same calibration file to be used repeatedly for all subse- 
quent analyses of the same RNA region. 

Table 1 summarizes some of the differences between 
FAST and SHAPEfinder. PeakScanner is a freely avail- 
able software program by Applied Biosystems, Inc (ABI), 
which we use to integrate the area of peaks in a trace. As 
noted previously, CAFA also uses a single fluorophore for 
RNA probing experiments, but it lacks the automated 
j-axis scaling and peak identification algorithms that 
make up the heart of the FAST program (8). Here, we 
demonstrate the utility of FAST by applying it to the 
analysis of the 5' end of the hepatitis C virus (HCV) 
RNA genome. This segment encodes a critically important 
region of RNA secondary structure, the internal ribosome 
entry site (IRES) — several stretches of which have been 
solved by high-resolution nuclear magnetic resonance 
(NMR) and crystallography that provide for ideal 'gold 
standards' against which to compare FAST-generated 
results — and the HCV core protein. We first examine the 
IRES secondary structure by FAST, and then assess how 
it changes in response to addition of a small molecule 
ligand. Finally, we use SHAPE/FAST to resolve the 
RNA secondary structure downstream of the IRES, the 
core-encoding region of the HCV genome, which has 
previously been suggested to contain numerous RNA 
structures, some of which have been implicated in 
packaging. 



METHODS 

The DNA polymerase chain reaction (PCR) template for 
the genotype (GN) lb/conl 5'UTR RNA was generated 
from Bart79I (10). The DNA PCR template for the GN2/ 
J6 RNA was generated from FL-J6/JFH (1 1). RNAs were 
generated by in vitro transcription, using T7 MEGAscript. 



Page 3 of 1 1 



Nucleic Acids Research, 2011, Vol. 39, No. 22 el51 



RNAs for SHAPE were purified by MEGAclear, with 
purity and length verified by capillary electrophoresis. 
HCV RNA was folded [100 mM NaCl; 2.5 mM MgCl; 
65°Cxl', 5' cooling at room temperature, 37°C for 
20-30') as previously described (12), but in 100 mM 
HEPES, pH = 8. 

2' acylation with NMIA (5) and reverse transcription 
(RT) primer extension were performed at 45°Cxl', 
52°Cx25', 65°Cx5', as previously described (4). 6FAM 
was used for all labelled primers (see Supplementary 
Data for a list of primers.) Exceptions to these protocols 
were as follows: (i) RNA purification after acylation as 
well as removal of micro RNA (miRNA) before RT (as 
appropriate), was performed using RNA C&C columns 
(Zymoresearch), rather than ethanol precipitation; 
(ii) before and after SHAPE primer buffer was added, 
the mixture was placed at room temperature for 
2-5 min, which enhanced RT transcription yields signifi- 
cantly; (hi) DNA purification was performed using 
Sephadex G-50 size exclusion resin in 96-well format 
then concentrated by vacuum centrifugation, resulting in 
a more significant removal of primer; and (iv) 1 pmol 
rather than 3 pmol of RNA was used in ddGTP RNA 
sequencing reactions. 

Compound S-l (kindly provided by Dr. Thomas 
Hermann) binding experiments were carried out as 
follows: After folding the GNlb RNA, S-l was added to 
the RNA at the following concentrations: 200 uM, 
126.4 uM, 40 uM, 12.66 uM, 4uM, 1.26 uM, 400 nM, 
126. 4nM, 40nM, and allowed to incubate for ~10min. 
Pearson's correlation coefficient (a) was used to examine 
the correlation between S-l concentration and SHAPE 
reactivity, for each nucleotide. 

The ABI 3100 Genetic Analyzer (50 cm capillaries filled 
with POP6 matrix) was set to the following parameters: 
voltage 15 kV, r=60°C, injection time = 15 s. The 
GeneScan program was used to acquire the data for 
each sample, which consisted of purified DNA resus- 
pended in 9.75 ul of Hi-Di formamide, to which 0.25 ul 
of ROX 500 internal size standard (ABI Cat. 602912) 
was added. 

PeakScanner parameters were set to the following 
parameters: smoothing = none; window size = 25; size 
calling = local southern; baseline window = 51; peak 
threshold = 15. Fragments 250 and 340 were computa- 
tionally excluded from the ROX500 standard (13). 
RNAstructure parameters: slope and intercept param- 
eters of 2.6 and — 0.8 kcal/mol, were initially tried, as sug- 
gested (14); however, we found that smaller intercepts 
closer to 0.0 kcal/mol (e.g. ~— 0.3) produced fewer less 
optimal structures (within a maximum energy difference 
of 10%). We speculate that this minor parameter differ- 
ence may be due to the precise fitting achieved between 
experimental and control data sets by the automated 
FAST algorithm. FAST was written in ANSI C/C++ 
and the code is available upon request; in its current im- 
plementation, FAST is integrated into RNAstructure, 
which requires MFC (Microsoft Foundation Classes). 
RNA structures were drawn and coloured using 
RNA Viz 2 (15) 



RESULTS AND DISCUSSION 

Automated j-axis scaling 

The various processing steps inherent to the SHAPE 
chemical probing method inevitably result in undesired 
differences among samples. In FAST, these differences 
are automatically corrected for in an algorithm referred 
to as load error correction. 

The goal of load correction is to normalize two traces to 
one another, by scaling the traces such that the 
low-reactive nucleotides in the two traces overlap. This 
requires distinguishing low-reactive nucleotides from 
high-reactive nucleotides, which result in peaks of 
interest. In SHAPEfinder, the user attempts to visually 
overlap the least reactive 5-10% of peaks in the control 
arm to those in the experimental arm. 

The challenge of visually manipulating one trace such 
that its least-reactive peaks overlap with the least-reactive 
peaks in a second trace is exemplified in experiment 1 
(Figure 1A) and experiment 2 (Figure IB). In experiment 
1, the control arm has been deliberately overloaded with 
respect to the experimental arm, resulting in the area of 
each peak in the control data being, on average, higher 
than the peak area in the experimental data. Visually, 
because the control arm data are so prominent, a user 
might conclude that the data are too noisy to be meaning- 
ful. In experiment 2, the control arm has been diluted 
down with respect to the experimental arm. Visually, 
because the control arm looks virtually flat, the user 
may be tempted to assume that no gain adjustment is ne- 
cessary. [In CAFA, there does not appear to be a normal- 
ization step: instead, an arbitrary cutoff of a 3-fold 
change over the mean background indicates positions of 
concern (8).] 

In FAST, j'-axis scaling is conceptualized not as a 
process of overlapping two traces, but instead as a 
process of normalizing the distribution of peak areas in 
the control data to the distribution of peak areas in the 
experimental data (Figure 1C and D). This makes scaling 
relatively trivial. A scaling factor is computed that in- 
creases or decreases the median of the experimental data 
such that it matches the median of the control data. The 
power of this method is illustrated in Figure IE. Despite 
the notable difference in absolute and relative peak areas 
for the control and experimental data sets in experiment 1 
compared to the data in experiment 2, each experiment 
has captured the same shape reactivity data, as evidenced 
by the overlapping traces. No user visualization or ma- 
nipulation is required. If more than one experimental 
arm exists, all arms can be normalized to the control 
arm using this method — this results in the same scale for 
all experimental conditions tested. Practically, high load 
correction factors >2 indicate that the data are of limited 
quality; conversely, overruling of high quality data can be 
avoided by excluding highly reactive nucleotides in the 
experimental trace from the process by which the scaling 
factor is determined (see Supplementary Data for a 
detailed description of this process). 

Thus, this algorithm enables correction along the j-axis 
for any differences that may result from electrokinetic in- 
jection, loss of sample or loading error. 
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Figure 1. Load error correction. (A) Experiment 1, in which the control sample concentration is greater than the experimental sample concentration 
(~2x). (B) Experiment 2, in which the control sample concentration has been decreased relative to the experimental sample concentration (~l/2x). 
(C, D) Peak area traces are transformed into box-plot distributions, which are then normalized by their median values. After normalization, the 
resultant SHAPE reactivity traces (E) demonstrate that the two experiments have captured the same experimental data, despite their differing visual 
appearance. 



Signal decay correction 

An overall trend is sometimes observed in electrophoretic 
traces: shorter fragments have higher signals, on average, 
than longer fragments. This phenomenon is called 'signal 
decay' (6). One reason for signal decay is that, while the 
probing reaction with the electrophile is carried out such 
that any given RNA molecule is modified only once, 
multiple hits still do occur. As a reverse transcriptase 
will pause at any first modification, this biases the distri- 
bution of fragments towards shorter ones. Another reason 
is the imperfect processivity of the superscript III reverse 
transcriptase (3). Additionally, we have observed that salt 
and primer concentration appear to effect signal decay. 



Desalting and removal of the primer by size exclusion 
chromatography appear to significantly reduce the decay 
rate. We suspect that one explanation for this observation 
is that DNA samples are taken up into the capillary by 
electrokinetic injection, a process that is dependent 
on solvent conductivity and solute electrophoretic 
mobility (9). 

Weeks and colleagues observed that signal decay can be 
modelled as (6) 



D 



^pelution time_|_^-t 



(i) 



In this equation, D is the correction factor by which each 
peak area is divided; A and C are scaling factors for the 
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initial and final trace intensities, respectively; and P is the 
probability of extension. Since our fragment analysis tech- 
nique assigns a standardized fragment length to each 
peak, elution time in this equation is replaced by the 
fragment length of each peak. 

In SHAPEfinder, the user can rapidly try out different 
values for P, such that after the peak area is divided by 
D (corrected peak area), the corrected peaks at the start, 
middle and end of the trace have similar heights by visual 
inspection. In FAST, we automate this procedure by 
(i) performing a linear fit of corrected peak areas versus 
their location and (ii) empirically determining the value of 
P that minimizes the magnitude of the slope of this linear 
fit. Because a slope of zero is a horizontal line, then when 
the magnitude of the slope of the linear fit is minimized, 
the P-value that results in the most even peak intensities 
at the beginning, middle and end of the trace has been 
identified. This empirical process is illustrated graphically 
in Figure 2A. As discrete values of P are tested (incre- 
ments of 0.001), the absolute value of the slope reveals 
the inflection point where the slope changes from 
negative to positive. The slope changes sign because 
over-correction inverts the trace, resulting in corrected 
peaks at the start of the trace being lower than peaks at 
the end of the trace. In the example in Figure 2A, P was 
calculated to be 0.964. 

Peak identification 

Nucleotide positions are integer values. Thus, in an ideal 
trace, the spacing between successive fragments, as 
denoted by the width of valleys between peaks in the 
electrophoretogram, should all be uniform. In such a 
scenario, fragment lengths, as assigned using an internal 
size standard, would be identical to nucleotide position. 

Band compression describes the irregularities in spacing 
that occur in an actual trace, which are thought to result 
from secondary structure. Band compression results in a 
changing offset between fragment length and nucleotide 
position (Figure 2B). A critical and novel component of 
the FAST algorithm is its ability to algorithmically adjust 
for band compression. 

The input for this algorithm is an RNA sequence and 
the table of peak areas and their fragment lengths 
generated from a ddGTP sequencing ladder. FAST 
begins by assigning the nearest G in the RNA sequence 
to the peak whose fragment length is closest to the nucleo- 
tide number for that G (but where such assignments re- 
sulting in offsets larger than 5.5 nt are not allowed). This 
results in a first gross alignment between peak fragment 
length and nucleotide position, as illustrated in the 
Figure 3A. Notably, we have observed, and the algorithm 
assumes, that fragment length is always smaller than its 
corresponding nucleotide position. 

Further refinement is then necessary because the ladder 
trace generated by performing RT in the presence of 
ddGTP is imperfect, due to missing or aberrant peaks. 
In order to automatically correct for these phenomena, 
FAST takes advantage of the observation that the offset 
(delta) between fragment length (FL) and true nucleotide 
position (NP), A(NP-FL), varies in an incremental 
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Figure 2. (A) Signal decay correction. Graphical depiction of the em- 
pirical determination of P in equation (1). Signal decay in an electro- 
phoretic trace — in which smaller fragments have larger peak areas 
compared with longer fragments — can be described by an exponential 
decay function. Parameter P (probability of extension) can be 
determined empirically by testing values for P such that a line fit to 
the corrected peak area (peak area/D) versus DNA fragment length, 
has a slope, s, that is minimized. (B) Fragment length is an imperfect 
reflection of nucleotide position. Fragment length does not perfectly 
correlate with nucleotide position; their difference is depicted (grey 
line). This changing offset is the result of band compression. FAST 
algorithmically calculates this changing offset (grey line), using the ob- 
servation that the offset can only change (delta-delta, black filled 
circles) in an incremental fashion. This allows for proper 
peak-to-nucleotide position correlation (peak identification). 
NP = nucleotide position; FL = fragment length. See text for further 
discussion. 



fashion. Successive deltas are always quite similar, such 
that delta-delta [where A A; = (NP,-FL,) - (NP (M) - 
FL (M) ), and i is the specific nucleotide being assigned] is 
small (<~1.0) (Figure 2B). It also follows that each indi- 
vidual A(NP-FL) does not deviate significantly from the 
average of all A (NP-FL) values, <A(NP-FL)>. 

These observations allow FAST to analyse the gross 
alignment previously generated, and make adjustments 
that minimize the delta-delta value for assignments. This 
type of local correction is illustrated in Figure 3B. If we 
imagine that the peak highlighted in light orange (80.4) in 
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Figure 3. Simulation of peak identification algorithm. (A) Example 
calibration table of fragment lengths with their true nucleotide pos- 
itions, assigned based on a gross alignment (see text). (B) The peak 
found at 80.4 is deleted, to simulate a missing peak or oversaturated 
peak at this location. This results in the mistaken assignment of NT 
position 84 to the peak at 81.6. At the same time, this causes the delta- 
delta in column 4 to become large, signifying to the algorithm that a 
mistake in the calibration table has been made. (C) The algorithm 
attempts local reassignments, in order to minimize the delta-delta, re- 
sulting in the correct re-assignment of nucleotide 85 to the peak with 
fragment length 81.6. 



Figure 3A is missed — because of high background or 
being missed due to a low level of termination at that 
location — this causes the peak at 81.6 (highlighted in 
green) to be incorrectly assigned during the gross align- 
ment to nucleotide 84. This also causes the absolute value 
of the A A in this local area to become high (bolded red 



text). FAST corrects this by attempting to find the peak 
most likely to correspond to unassigned nucleotide 85 
(light purple), using the same heuristics as for the gross 
alignment. FAST then checks to see if this reassignment 
(nucleotide 85 to peak 81.6) minimizes the A As observed 
as compared to the original alignment (Figure 3C). This 
process is then repeated for nucleotide 84, which is now 
newly unassigned. In this case, however, because the 
nearest peak with a smaller fragment length is 71.6, and 
the difference between the fragment length and its 
assigned nucleotide position (84-71.6 > 5.5) is too high 
to be a reasonable assignment; this terminates the local 
re-assignment process for this region. 

FAST also performs a linear regression fit of fragment 
length to nucleotide position, and determines the residual 
for each peak. When this residual is high, the calibration 
file is flagged for inspection by the words 'REVIEW THIS 
REGION'. These assignments are beyond the ability of 
the algorithm, and user inspection is requested. Usually, 
assignments are correct, but can be manually adjusted 
within the program as needed. 

FAST then uses the Local Southern method (16) along 
with this calibration table to convert fragment length to 
nucleotide numbers for each peak in the experimental and 
control data of interest. We have observed the Local 
Southern method to be quite robust. We have not found 
it necessary for the calibration table to capture all ddGTP 
termination events (peaks), which allows for specificity to 
be emphasized over sensitivity. For this reason, assign- 
ments are attempted by default for only the top 25% 
(by area) of peaks in a ddGTP trace, although this 
number can be varied by the user. Furthermore, because 
the fit is local, locally incorrect assignments do not affect 
assignments elsewhere. 

The final output of the FAST module is the same as for 
SHAPEfinder: a SHAPE file that lists nucleotide positions 
and their SHAPE reactivities. This file is read into 
RNAstructure as a forced pseudo-free energy constraint, 
resulting in a proposed RNA secondary structure. 

FAST has been optimized to work specifically with 
SHAPE data. For example, FAST interprets values that 
are higher than control values as meaningful and values 
that are lower than control values as meaningless/noise. 
Thus, while the algorithms described here are, in theory, 
agnostic to the probing method used, recoding would 
likely be necessary for its use with other techniques. For 
example, FAST cannot readily be used to analyse 
hydroxyl radical footprinting data, as control values 
(unfolded state) for such experiments are higher in value 
than experimental values (folded state) (17,18). 
Conversely, other chemical probing techniques, such as 
those that use DMS or CMCT (17,19) and result in data 
more similar to that of SHAPE, may be compatible with 
FAST analysis without much modification, if any. 

Reproducibility 

To assess the precision of FAST, we performed a series of 
analyses. Run-to-run variability in calculated shape 
reactivity (inter-experimental error) is plotted in 
Figure 4A. The same primer was used to initiate RT on 



Page 7 of 1 1 



Nucleic Acids Research, 2011, Vol. 39, No. 22 el51 



A 2 



1.5 

>. 
> 

CO 

o 

L. 

UJ 
£L 

| 0.5 
CO 



-0.5 



Run to Run Variability 
(same primer) 




• ♦ • 

t*t Pearson's 

coefficient (r) = . 9665 



♦ * I 



-0.5 



B 2 



1.5 



LU 
0- 
< 
X 

CO 



0.5 



-0.5 



0 0.5 1 1.5 

SHAPE reactivity 



Run to Run Variability 
(two d ifferent p ri mers) 



( Pearson's 

coefficient (r) = . 9067 



-0.5 



0.5 



SHAPE reactivity 



1.5 



Figure 4. Reproducibility of FAST. (A) Two separate SHAPE experi- 
ments analyzed by FAST, in which the same primer was used during 
RT. The resultant SHAPE reactivities are plotted against one another 
and demonstrate the strong reproducibility of the FAST method to 
accurately assign NT position, correct for signal decay, and adjust 
for loading differences. See also Figure 1. (B) Two separate SHAPE 
experiments, in which two different but nearby primers were used 
during RT. The resultant SHAPE reactivities are plotted against one 
another. The observed correlation suggests against the possibility that 
the peak identification algorithm is specific to one primer, and that 
instead, it is a general means of correlating peaks with NT positions. 



the same region of RNA, from two different 
NMIA-probed RNA samples. A high correlation 
(Pearson's coefficient = 0.97) was observed. Run-to-run 
variability was also evaluated for an entirely different 
RNA, the hepatitis delta virus, and found to be similarly 
small. Next, we assessed the accuracy of peak-to- 
nucleotide assignment by using two different primers, 
located near each other, and determined the shape reactiv- 
ity of the same HCV RNA region (Figure 4B). The 
SHAPE reactivity patterns for this region are in registry, 
suggesting that FAST algorithms have not been tuned 
only to one specific region of a given RNA or one 
specific primer. 



Additionally, we tested the reproducibility of FAST by 
determining the SHAPE reactivity of the same HCV 
region (nucleotide 29 through 112) in six independent ex- 
periments. Using these data, we calculated the standard 
deviation in SHAPE values for every position in this 
region. The average of these standard deviations was 
determined to be 0.04 SHAPE units (scale 0-1.5). The 
range (high and low) for every position was also 
calculated. The average of these ranges was determined 
to be 0.11 SHAPE units (scale 0-1.5). Thus, in the 
worst-case scenario of the highest versus lowest SHAPE 
reactivity determined for a given position, the measure- 
ment varies, on average, by 0.11 SHAPE units (see 
Supplementary Data Figure SI). 

Examples 

We have used this SHAPE/FAST system in a number of 
contexts. We have recently determined the structure of the 
genotype lb/conl 5'UTR and found it to be consistent 
with all available NMR and crystallographic information. 
We have also used it to determine the structure of the 
complex formed between miRNAs-122 and its second 
target site in HCV (Pang et al., submitted for publication). 

To further demonstrate the utility of this approach, we 
have also used it to (i) identify the binding site of a small 
molecule to domain II of the HCV IRES (Figure 5) and 
(ii) propose a structure for the core gene of HCV 
(Figure 6). 

Small ligand binding. Domain II of the HCV IRES is im- 
portant for viral translation (20). A small molecule, 
known as ISIS-11 (21), has been found that binds to the 
first bulge in domain II (22), known as Ha, altering its 
conformation and inhibiting viral translation (23,24). 
Given the long history of using chemical probing 
methods to detect the binding of small molecules to ribo- 
somal RNA (25), we evaluated whether or not SHAPE/ 
FAST could be used to detect the binding of a small 
molecule to domain II. We performed SHAPE experi- 
ments in the presence of 0.5 nM to 200 uM of an ISIS-11 
derivative, S-l. We then calculated the Pearson correlation 
coefficient for each nucleotide in domain II as a function 
of the log concentration of S-l. In this manner, we 
identified 5nt whose conformation was significantly 
altered by the presence of S-l (Figure 5A). Each of these 
nucleotides is found near the binding pocket for the S-l 
compound (Figure 5B), suggesting that SHAPE can be 
used to characterize and potentially identify the binding 
site of small molecules to viral RNA. 

RNA secondary structure of HCV core gene. The core 
gene of HCV has been postulated to contain a number 
of RNA structures by both computational and enzymatic 
probing (26-28). How these structures fit together in the 
context of the entire core gene, and whether other struc- 
tures are present in this gene, is unknown. Additionally, it 
was shown that the core gene contains elements that allow 
for translation to initiate downstream of the classic HCV 
IRES start site (29). The core gene is also controversially 
postulated to contain an alternate reading frame, 
producing a protein product named F (30-32). Using the 
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Figure 5. Effect of S-1 compound on SHAPE reactivity of nucleotides in the HCV 5'UTR. (A) Varying concentrations of S-1, an ISIS- 1 1 analogue, 
were incubated with folded HCV RNA. The resultant SHAPE reactivities versus S-1 concentration were examined using Pearson's correlation 
coefficient (a). An adjusted Pearson's score (cr 4 sqrt(RANGE)), for each NT position, which emphasizes strong correlations and incorporates the 
magnitude of the change in SHAPE reactivity versus nucleotide position, is plotted. Five nucleotides in the HCV 5'UTR were identified as affected 
by the presence of compound S-1. (B) These 5nt map to the domain Ha bulge of the HCV 5'UTR, where it has previously been shown that analogue 
compound ISIS- 11 binds, distorting the Ila bulge (22). 



SHAPE/FAST system described here, we have derived a 
proposed RNA secondary structure of the entire core gene 
of HCV (Figure 6). The RNA stem-loop structures in the 
core gene have been labelled A-J. This structure reveals a 
number of new putative structures (C-F, H and J), and is 
notably consistent with previous determined RNA struc- 
tures A, B and I. The nucleotides in the region of structure 
G have been previously postulated to form a structure 
known as SL248, which was enzymatically probed in iso- 
lation (28). Our structure is largely consistent with this 
structure, but varies in its bottom half. We postulate 
that this difference is due to the presence of other RNA 
core elements which did not exist in the construct origin- 
ally used to probe SL248. Stem-loop G also contains the 
postulated initiation codon for mini-core, an alternate site 
of in-frame translation initiation. 



As has been previously observed, the core gene con- 
tains a number of codons whose third positions are abso- 
lutely conserved (26,30) (Figure 7, bottom right), 
suggesting either an RNA structure of interest or as pre- 
viously mentioned, an alternate reading frame protein. 
It is intriguing that the region between domain IV of 
the HCV IRES and Stem-Loop (SL)-A/domain V/SL47, 
contains such a high density of codons whose third 
positions are absolutely conserved, given that this region 
has high SHAPE reactivity and is expected to be unstruc- 
tured. The absence of an RNA structure in this region 
favours the possibility of an alternate reading frame 
protein. 

An obvious difference in average SHAPE reactivity was 
observed for the non-coding versus the coding region of 
HIV, with the latter having significantly higher overall 
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Figure 6. Postulated RNA secondary structure of the first ~900 nt of the HCV genome, based on a SHAPE/FAST analysis. RNA structures in the 
core gene are labelled A-J. The HCV IRES (domains II-IV) are accurately determined, along with previously postulated RNA structures within the 
core gene labelled A, B, G and I. Six novel RNA helical/stem-loop structures were also determined, labelled C, D, E, F, H and J. Structure G 
(SL248) contains codon 91, where another internal initiation of translation has been proposed to occur. The region between domain IV and V 
(Structure A), is unstructured and proposed to encode an alternative reading frame protein. 



reactivity (6). In contrast, we observe that while the HCV 
IRES region has the lowest average SHAPE reactivity, the 
core region has some areas of high reactivity, but some 
areas of notably low reactivity as well. This is consistent 
with the observations by Simmonds et al. (33) that HCV 
contains significant RNA structure throughout its coding 
region. Intriguingly, the region near codon 91, where 



internal initiation is thought to also occur (29), defines a 
second region of high RNA structure density/low shape 
reactivity. Whether these structures form a second IRES 
can only be speculated (34). Mutagenesis studies will be 
needed to first confirm the RNA secondary structure 
shown here, and any relevance it may or may not have 
to internal initiation at codon 91. 
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Figure 7. Average SHAPE reactivity (window size = lOOnt) as a function of nucleotide position. The HCV IRES has the lowest SHAPE reactivity. 
The region after the AUG start site is unstructured yet possesses a high density of codons whose third position is conserved. Codon 91, the putative 
site of internal core translation initiation is indicated, and represents the region with the second lowest SHAPE reactivity. 



In summary, we present here a novel, rapid method for 
determining RNA secondary structure, based on SHAPE 
technology. We have then used this technology to map the 
binding of a small molecule to the HCV IRES, as well as 
to postulate for the first time the structure of the first 
~900nt of HCV, including the core gene. The software 
for RNAstructure with FAST can be found at http:// 
glennlab.stanford.edu. The source code for this software 
is available upon request. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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